In this exercise, I hope you learn the following:
To demonstrate and achieve the above objectives, we will be using 15-min flow data from the EA. The gauging station is located on river soar at Littlethorpe.
Before starting if you don’t have R and Rstudio installed in your machine, follow the links below to download R and Studio (install R first before installing Rstudio)
It is a good practice to create an R project before starting any R related project, this allows you to:
In general,R can either read the data from an online or local source, but for the purpose of this demo, we will download flow data from this link and place it in the sub-folder data
Did you say how? Well, to install packages, two main options are available:
install.packages() for example install.packages("tidyverse")Ok! Now install the following packages needed for the rest of this demo:
install.packages("tidyverse") install.packages("lubridate") install.packages("dygraphs") install.packages("glm2") install.packages("xts")
Next let’s load these installed packages:
# load the necessary libraries
library(tidyverse)
library(lubridate)
library(dygraphs)
library(glm2)
library(xts)The data contains some rows (20) that we don’t need (info related to the gaging station) so we will skip them using the skip argument from read_csv() function to skip the first 20 rows.
# First list files in the data folder (we know in this case that we have just 1 file but just in case you have several files)
files <- list.files("./data")
files## [1] "Littlethorpe 4082 15 Minute Flow 1983 to 2011.csv"
# Read the file and use skip argument (notice that we use paste0 function to Concatenate 2 Strings)
flow <- read_csv(paste0("./data/", files[1]), skip = 20)The variable names are a bit messy and need some adjustments. let’s change them using the function colnames()
# Change variable names
colnames(flow) <- c("time_stamp","flow","quality","interp","tags","comments")
# print flow
flow## # A tibble: 952,870 x 6
## time_stamp flow quality interp tags comments
## <chr> <dbl> <chr> <int> <chr> <chr>
## 1 01/04/1983 09:00:00 1.18 G 102 <NA> <NA>
## 2 01/04/1983 09:15:00 1.17 G 102 <NA> <NA>
## 3 01/04/1983 09:30:00 1.16 G 102 <NA> <NA>
## 4 01/04/1983 09:45:00 1.15 G 102 <NA> <NA>
## 5 01/04/1983 10:00:00 1.15 G 102 <NA> <NA>
## 6 01/04/1983 10:15:00 1.15 G 102 <NA> <NA>
## 7 01/04/1983 10:30:00 1.15 G 102 <NA> <NA>
## 8 01/04/1983 10:45:00 1.15 G 102 <NA> <NA>
## 9 01/04/1983 11:00:00 1.16 G 102 <NA> <NA>
## 10 01/04/1983 11:15:00 1.16 G 102 <NA> <NA>
## # ... with 952,860 more rows
let’s check how the data look like and get some descriptive statistics
# get summary stat of data
summary(flow) ## time_stamp flow quality interp
## Length:952870 Min. : 0.023 Length:952870 Min. :102
## Class :character 1st Qu.: 0.394 Class :character 1st Qu.:102
## Mode :character Median : 0.648 Mode :character Median :102
## Mean : 1.263 Mean :102
## 3rd Qu.: 1.320 3rd Qu.:102
## Max. :26.200 Max. :102
## NA's :8
## tags comments
## Length:952870 Length:952870
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
# or
glimpse(flow) ## Observations: 952,870
## Variables: 6
## $ time_stamp <chr> "01/04/1983 09:00:00", "01/04/1983 09:15:00", "01/0...
## $ flow <dbl> 1.18, 1.17, 1.16, 1.15, 1.15, 1.15, 1.15, 1.15, 1.1...
## $ quality <chr> "G", "G", "G", "G", "G", "G", "G", "G", "G", "G", "...
## $ interp <int> 102, 102, 102, 102, 102, 102, 102, 102, 102, 102, 1...
## $ tags <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ comments <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
# data dimensions
dim(flow) ## [1] 952870 6
# number of columns
ncol(flow)## [1] 6
# number of rows
nrow(flow)## [1] 952870
For example, we are going to check the different comments of the variable comment using unique() function to show unique elements
# Check the comments variable
unique(flow$comments) ## [1] NA
## [2] "manual edit."
## [3] "any spurious data may be the result of ice in the stilling well and/or channel"
From the summary results you can see that the class of the variable time_stamp is a character so it needs to be changed to datetime class. But before doing so, let’s try some examples of how you can use some functions from lubridate package to parse datetimes
## Examples of date time parsing:
my_brthdy <- "13/5/1984 13;20-00"
parsed_date <- dmy_hms(my_brthdy)
parsed_date## [1] "1984-05-13 13:20:00 UTC"
# extract year
year(parsed_date)## [1] 1984
# extract month
month(parsed_date)## [1] 5
# extract day
day(parsed_date)## [1] 13
# extract minutes
minute(parsed_date)## [1] 20
mutate verb:Here we are going overwrite the column time_stamp using mutate() and we use dmy_hms() to parse the time stamp.
# first thing to do is to parse date time
# in order to do that we are going to use functions dmy_hms() and the mutate() to overwrite the current date time variable
flow1 <- flow %>% mutate(time_stamp = dmy_hms(time_stamp))
flow1## # A tibble: 952,870 x 6
## time_stamp flow quality interp tags comments
## <dttm> <dbl> <chr> <int> <chr> <chr>
## 1 1983-04-01 09:00:00 1.18 G 102 <NA> <NA>
## 2 1983-04-01 09:15:00 1.17 G 102 <NA> <NA>
## 3 1983-04-01 09:30:00 1.16 G 102 <NA> <NA>
## 4 1983-04-01 09:45:00 1.15 G 102 <NA> <NA>
## 5 1983-04-01 10:00:00 1.15 G 102 <NA> <NA>
## 6 1983-04-01 10:15:00 1.15 G 102 <NA> <NA>
## 7 1983-04-01 10:30:00 1.15 G 102 <NA> <NA>
## 8 1983-04-01 10:45:00 1.15 G 102 <NA> <NA>
## 9 1983-04-01 11:00:00 1.16 G 102 <NA> <NA>
## 10 1983-04-01 11:15:00 1.16 G 102 <NA> <NA>
## # ... with 952,860 more rows
You can notice that the time_stamp variable class is changed to dttm
Select() verb to choose the relevant variablesFor example let’s say we wanted to keep just time_stamp,flow,quality,interp,tags variables. You can either write the variable number or the variable name
flow2 <- flow1 %>% select(time_stamp,flow,quality,interp,tags)
#print flow3
flow2## # A tibble: 952,870 x 5
## time_stamp flow quality interp tags
## <dttm> <dbl> <chr> <int> <chr>
## 1 1983-04-01 09:00:00 1.18 G 102 <NA>
## 2 1983-04-01 09:15:00 1.17 G 102 <NA>
## 3 1983-04-01 09:30:00 1.16 G 102 <NA>
## 4 1983-04-01 09:45:00 1.15 G 102 <NA>
## 5 1983-04-01 10:00:00 1.15 G 102 <NA>
## 6 1983-04-01 10:15:00 1.15 G 102 <NA>
## 7 1983-04-01 10:30:00 1.15 G 102 <NA>
## 8 1983-04-01 10:45:00 1.15 G 102 <NA>
## 9 1983-04-01 11:00:00 1.16 G 102 <NA>
## 10 1983-04-01 11:15:00 1.16 G 102 <NA>
## # ... with 952,860 more rows
Or we can just remove the the comments variable which give similar result.
flow1 %>% select(-comments) ## # A tibble: 952,870 x 5
## time_stamp flow quality interp tags
## <dttm> <dbl> <chr> <int> <chr>
## 1 1983-04-01 09:00:00 1.18 G 102 <NA>
## 2 1983-04-01 09:15:00 1.17 G 102 <NA>
## 3 1983-04-01 09:30:00 1.16 G 102 <NA>
## 4 1983-04-01 09:45:00 1.15 G 102 <NA>
## 5 1983-04-01 10:00:00 1.15 G 102 <NA>
## 6 1983-04-01 10:15:00 1.15 G 102 <NA>
## 7 1983-04-01 10:30:00 1.15 G 102 <NA>
## 8 1983-04-01 10:45:00 1.15 G 102 <NA>
## 9 1983-04-01 11:00:00 1.16 G 102 <NA>
## 10 1983-04-01 11:15:00 1.16 G 102 <NA>
## # ... with 952,860 more rows
# In order to subset the data for example annually we would need to separate the year, month, day, hour from time_stamp
flow3 <- flow2 %>% mutate(year = year(time_stamp),
month = month(time_stamp),
day = day(time_stamp),
hour = hour(time_stamp),
min = minute(time_stamp))
# Print flow3
flow3## # A tibble: 952,870 x 10
## time_stamp flow quality interp tags year month day hour
## <dttm> <dbl> <chr> <int> <chr> <dbl> <dbl> <int> <int>
## 1 1983-04-01 09:00:00 1.18 G 102 <NA> 1983 4.00 1 9
## 2 1983-04-01 09:15:00 1.17 G 102 <NA> 1983 4.00 1 9
## 3 1983-04-01 09:30:00 1.16 G 102 <NA> 1983 4.00 1 9
## 4 1983-04-01 09:45:00 1.15 G 102 <NA> 1983 4.00 1 9
## 5 1983-04-01 10:00:00 1.15 G 102 <NA> 1983 4.00 1 10
## 6 1983-04-01 10:15:00 1.15 G 102 <NA> 1983 4.00 1 10
## 7 1983-04-01 10:30:00 1.15 G 102 <NA> 1983 4.00 1 10
## 8 1983-04-01 10:45:00 1.15 G 102 <NA> 1983 4.00 1 10
## 9 1983-04-01 11:00:00 1.16 G 102 <NA> 1983 4.00 1 11
## 10 1983-04-01 11:15:00 1.16 G 102 <NA> 1983 4.00 1 11
## # ... with 952,860 more rows, and 1 more variable: min <int>
## check for NAs
# 1- check if yes or no there is missing data
anyNA(flow3$flow)## [1] TRUE
# 2- check how many missing values
sum(is.na(flow3$flow))## [1] 8
# 3- locate the missing values using the quality variable
missed <- flow3 %>% filter(quality == "M")
missed## # A tibble: 8 x 10
## time_stamp flow quality interp tags year month day hour
## <dttm> <dbl> <chr> <int> <chr> <dbl> <dbl> <int> <int>
## 1 1983-08-01 08:15:00 NA M 102 <NA> 1983 8.00 1 8
## 2 1984-04-01 08:45:00 NA M 102 <NA> 1984 4.00 1 8
## 3 1985-01-01 08:15:00 NA M 102 <NA> 1985 1.00 1 8
## 4 1985-04-01 08:45:00 NA M 102 <NA> 1985 4.00 1 8
## 5 1985-11-01 08:15:00 NA M 102 <NA> 1985 11.0 1 8
## 6 1985-12-01 08:45:00 NA M 102 <NA> 1985 12.0 1 8
## 7 1985-12-01 23:30:00 NA M 102 <NA> 1985 12.0 1 23
## 8 1986-07-01 23:45:00 NA M 102 <NA> 1986 7.00 1 23
## # ... with 1 more variable: min <int>
# or look for NAs
missed <- flow3 %>% filter(is.na(flow))
missed## # A tibble: 8 x 10
## time_stamp flow quality interp tags year month day hour
## <dttm> <dbl> <chr> <int> <chr> <dbl> <dbl> <int> <int>
## 1 1983-08-01 08:15:00 NA M 102 <NA> 1983 8.00 1 8
## 2 1984-04-01 08:45:00 NA M 102 <NA> 1984 4.00 1 8
## 3 1985-01-01 08:15:00 NA M 102 <NA> 1985 1.00 1 8
## 4 1985-04-01 08:45:00 NA M 102 <NA> 1985 4.00 1 8
## 5 1985-11-01 08:15:00 NA M 102 <NA> 1985 11.0 1 8
## 6 1985-12-01 08:45:00 NA M 102 <NA> 1985 12.0 1 8
## 7 1985-12-01 23:30:00 NA M 102 <NA> 1985 12.0 1 23
## 8 1986-07-01 23:45:00 NA M 102 <NA> 1986 7.00 1 23
## # ... with 1 more variable: min <int>
Now let’s check the number of records each year and see how complete our data is
#Calculate the number of records each year
records <- flow3 %>% group_by(year) %>% tally()
records## # A tibble: 29 x 2
## year n
## <dbl> <int>
## 1 1983 11710
## 2 1984 26365
## 3 1985 20637
## 4 1986 17569
## 5 1987 35040
## 6 1988 35136
## 7 1989 35040
## 8 1990 35040
## 9 1991 35040
## 10 1992 35136
## # ... with 19 more rows
You can see for example that the first 3 years are complete so you can choose to remove them
To illustrate the use of group_by() and summarize() we can for example aggregate the data to hourly time stamp or computing the annual maximum flow
# Aggregate the data to hourly
hflow <- flow3 %>% group_by(year,month,day,hour) %>%
summarize(flow = mean(flow, na.rm = TRUE),
datetime = first(time_stamp))
# print hflow
hflow## # A tibble: 238,219 x 6
## # Groups: year, month, day [?]
## year month day hour flow datetime
## <dbl> <dbl> <int> <int> <dbl> <dttm>
## 1 1983 4.00 1 9 1.16 1983-04-01 09:00:00
## 2 1983 4.00 1 10 1.15 1983-04-01 10:00:00
## 3 1983 4.00 1 11 1.17 1983-04-01 11:00:00
## 4 1983 4.00 1 12 1.17 1983-04-01 12:00:00
## 5 1983 4.00 1 13 1.17 1983-04-01 13:00:00
## 6 1983 4.00 1 14 1.19 1983-04-01 14:00:00
## 7 1983 4.00 1 15 1.18 1983-04-01 15:00:00
## 8 1983 4.00 1 16 1.18 1983-04-01 16:00:00
## 9 1983 4.00 1 17 1.17 1983-04-01 17:00:00
## 10 1983 4.00 1 18 1.17 1983-04-01 18:00:00
## # ... with 238,209 more rows
# Calculate the annual max flow
am_flow <- flow3 %>% group_by(year) %>%
summarize(am_flow = max(flow, na.rm = TRUE))
am_flow## # A tibble: 29 x 2
## year am_flow
## <dbl> <dbl>
## 1 1983 13.0
## 2 1984 15.7
## 3 1985 7.34
## 4 1986 9.55
## 5 1987 16.3
## 6 1988 23.5
## 7 1989 20.6
## 8 1990 17.5
## 9 1991 16.3
## 10 1992 16.6
## # ... with 19 more rows
# or for example Choosing the number of days where the flow exceeds the 99.5th percentile
# 1: find the Q99.5
Q995 <- quantile(hflow$flow, prob = 0.995, na.rm = T)
# 2: values exceeding the Q995
Q995_flow <- hflow %>% filter(flow >= Q995)
Q995_flow## # A tibble: 1,198 x 6
## # Groups: year, month, day [124]
## year month day hour flow datetime
## <dbl> <dbl> <int> <int> <dbl> <dttm>
## 1 1983 4.00 11 13 11.8 1983-04-11 13:00:00
## 2 1983 4.00 11 14 12.1 1983-04-11 14:00:00
## 3 1983 4.00 11 15 12.3 1983-04-11 15:00:00
## 4 1983 4.00 11 16 12.2 1983-04-11 16:00:00
## 5 1983 4.00 11 17 12.1 1983-04-11 17:00:00
## 6 1983 4.00 11 18 12.0 1983-04-11 18:00:00
## 7 1983 5.00 2 5 12.4 1983-05-02 05:00:00
## 8 1983 5.00 2 6 12.6 1983-05-02 06:00:00
## 9 1983 5.00 2 7 12.8 1983-05-02 07:00:00
## 10 1983 5.00 2 8 12.8 1983-05-02 08:00:00
## # ... with 1,188 more rows
# 3 find the number of hours each year
exceedance <- Q995_flow %>% group_by(year) %>%
summarise(events = n())
exceedance## # A tibble: 23 x 2
## year events
## <dbl> <int>
## 1 1983 16
## 2 1984 16
## 3 1987 49
## 4 1988 35
## 5 1989 74
## 6 1990 25
## 7 1991 18
## 8 1992 50
## 9 1993 123
## 10 1994 42
## # ... with 13 more rows
# select just years where no values exceed the threshold (this example with show also how you can )
non_exce <- hflow %>% ungroup(year, month, day) %>%
select(year) %>% distinct(year) %>%
filter(!year %in% exceedance$year)
# add 0 if no event
non_exce$events <- 0
exceedance <- full_join(exceedance,non_exce, by = c("year", "events")) %>% arrange(year)
exceedance## # A tibble: 29 x 2
## year events
## <dbl> <dbl>
## 1 1983 16.0
## 2 1984 16.0
## 3 1985 0
## 4 1986 0
## 5 1987 49.0
## 6 1988 35.0
## 7 1989 74.0
## 8 1990 25.0
## 9 1991 18.0
## 10 1992 50.0
## # ... with 19 more rows
Many useful R packages exist out there to visualise data but here we use one of the popular packages called ggplot2.
Let’s for example select and plot flow measurements of the year 2000:
#1- subset data year 2000
one_year <- hflow %>% filter(year == 2000 )
# line plot
ggplot(data = one_year, aes(x = datetime, y = flow))+
geom_line(color="blue")# histogram
ggplot(data = one_year, aes(x = flow))+
geom_histogram(fill = "blue")# box plot
ggplot(data = one_year, aes(x = datetime, y = flow, group = month))+
geom_boxplot()ggplot2 is a powerful graphic package and contains plenty of functions and arguments that allow adjusting/customizing graphics
Now let’s plot multiple graphics using the ggplot’s facet_wrap() function
# Facet wrap (for several plots)
# select for example 8 years of data
multi_yr <- hflow %>% filter(year %in% 2000:2008 )
# line graphs
ggplot(data = multi_yr, aes(x = datetime, y = flow))+
theme_bw()+ #simple theme
geom_line(color = "blue")+
facet_wrap(~year,scales = 'free_x')+
scale_x_datetime(date_labels = "%b")# histograms
ggplot(data = multi_yr, aes(x = flow))+
geom_histogram()+
facet_wrap(~year,scales = 'free_x')# Annual maximum flow + local polynomial regression fitting
ggplot(am_flow, aes(x = year, y = am_flow))+
geom_line()+
geom_smooth(method = "loess", se = T, span = 0.30,color = "blue")Or let’s try some interactive plot using dygraph package where you can zoom and hover over the graphic to see details.
# Interactive plots
hflow_ts <- as.xts(hflow,hflow$datetime)
hflow_ts[,5] %>% dygraph %>% dyRangeSelector()The models here are not necessarily the right models to use for these data but it’s just to illustrate how you can include model fits and other ggplot graphics.
#simple linear model
mod1 <- lm(events ~ year, data =exceedance)
#Poisson regression
mod2 <- glm2(events ~ year, data =exceedance,family="poisson")
#mod2 summary
summary(mod2)##
## Call:
## glm2(formula = events ~ year, family = "poisson", data = exceedance)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -10.045 -7.830 -1.280 3.062 10.777
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -25.882309 6.931474 -3.734 0.000188 ***
## year 0.014820 0.003469 4.272 1.94e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1174.6 on 28 degrees of freedom
## Residual deviance: 1156.2 on 27 degrees of freedom
## AIC: 1284.9
##
## Number of Fisher Scoring iterations: 5
pred_y <- fitted(mod2)
#plot all in one graphic
ggplot() +
theme_bw() +
geom_point(data = exceedance, aes(x = year, y = events))+
geom_line(data = exceedance, aes(x = year, y = events))+
geom_smooth(data = exceedance, aes(x = year, y = events),
method = "lm",color = "red")+
geom_smooth(data = exceedance, aes(x = year, y = events),
method = "loess",span = 0.30,se=T,color = "blue")+
geom_line(data = exceedance, aes(x = year, y = pred_y),
colour ="blue",size = 1)This section is under preparation